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Abstract 

A new family of compound Poisson distribution functions from sta- 
tistical linguistic is used to study the n-tuples and nucleotide composi- 
tion features of DNA sequences. The relative frequency distribution of 
the 6-tuples and 7- tuples occurrence studies suggest that most of the 
DNA sequences follow the general shape of the compound Poisson dis- 
tribution. It is also noted that the x-square test indicated that some of 
the sequences follow this distribution with a reasonable level of good- 
ness of fit. The compositional segmentation study fits quite well using 
this new family of distribution functions. Furthermore, the absolute 
values of the relative frequency come out naturally from the linguis- 
tic model without ambiguity. It is suggesting that DNA sequences are 
not random sequences and they could possibly have subsequence struc- 
tures. 

Keywords : DNA segmentation. Statistical linguistic, Compound Pois- 
son distribution, Jensen-Shannon divergence measure 

PACS numbers: 87.10.-he, 87.14.Gg 



1 Introduction 

In recent years, the subject of bioinformatics has emerged as an active re- 
search subject in biology and other fields such as physics. Researchers are be- 
ginning to search for DNA words [1,2] and build up dictionaries for genomes 
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[3]. In doing so, people are more willing to regard information stored in 
DNA sequences as a natural language from nature. A lot of these activities 
are indeed to employ statistical methods to study DNA sequences. In an 
early attempt, researchers [4] used the Zipf's law [5] to study the statistical 
features that are embedded in DNA sequences. Other researchers [8] subse- 
quently used different distributions to fit the rank of the word distributions 
in DNA sequences and obtained better fit than the original Zipf plot. The 
Zipf's law was first proposed in 1932 when George Zipf made an empiri- 
cal observation on some statistical regularities of human writings which has 
become the most prominent statement of statistical linguistic. In his orig- 
inal work, Zipf found that if the number of different words in a given text 
were arranged in the order of their frequency of usage, there would be an 
approximate mathematical relation between the frequency of occurrence of 
each word and its rank in the list of all the words used in the text that were 
ordered by decreasing frequency. He later pointed out that similar relations 
also hold in other contexts [6, 7]. 

Let us associate a particular word by an index r equal to its rank, and by 
/(r) the normalized frequency of occurrence of that word, i.e., the number 
of times it appears in the text divided by the total number of words TV. 
Zipf's law states that there is an approximate relation between /(r) and r 



where a{> 1) and A are constants. The above mathematical relation was 
used [4] to study the statistical features of DNA sequences where similar 
scaling behavior was found. It was however noted that for sequences com- 
posed of primarily coding regions, the data were well fitted by a logarithmic 
function [8]. And just like in the case of linguistic, the Zipf's law could only 
account for a limited zone of the rank variable. 

In the early days of quantitative linguistics, researchers, notably Yule 
[10] had suggested that the mathematical relation (1) proposed by Zipf was 
unsatisfactory. He conjectured that the correct distribution for word fre- 
quencies would be a compound Poisson model. However, there is no fitting 
of any mathematical distribution law to the extensive data in his book. 
Good [11] later proposed to overcome Yule's objection by introducing a 
convergence factor into the Riemann distribution which gives 



where r>l,0<^<l,a>0 and A is the normalization constant in terms 



(1) 



/(r) = Ar-'^e'' 



(2) 
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of 9 and a. Neither Good nor any other authors have fitted (2) to any real 
data except for 9 very close to one. One should also note that the primary 
reason to introduce 9 was to achieve convergence and secondary to improve 
the fit. In [10], Good also proposed a distribution function to fit two sets 
of data but were rather poor for large values of r. Researchers in the field 
have also introduced other distribution functions to fit word frequencies. 
However, as pointed out by Herdan [11], the word frequency distribution 
functions are characterized by the properties of both combinability and di- 
visibility without altering the essential mathematical characteristics of the 
distribution function. The only distribution functions have these two prop- 
erties are of the compound Poisson type. With this as the starting point, 
Sichel [12] introduced a new family of compound Poisson distribution func- 
tions to fit word frequencies. He [13] also used this family of distribution 
functions to fit sentence-length, which was first considered by Yule [10]. The 
fit in both cases were encouraging. 

A natural question to ask is whether the quantitative studies made in 
linguistics can be carried out in a similar fashion in DNA sequences. In 
particular, we would like to know if the compound Poisson distribution 
functions introduced in the study of quantitative linguistics are universal, in 
the sense that they can be used to study human designed languages such as 
the languages we use everyday as well as the language used by nature — the 
information stored in DNA sequences. We will answer the above question by 
carrying out quantitative studies in DNA sequences using these compound 
Poisson distribution functions. In section II, we give a brief review of the 
family of compound Poisson distribution function used in this paper, which 
was first introduced by Sichel. In section III, we use this family of Poisson 
distribution functions to study the statistical features of the word frequencies 
in DNA sequences. Section IV is a statistical study of the sentence-length 
in DNA sequences. Section V is the discussion and summary. 

2 The Mathematical Model 

In this section, we give a brief introduction of the mathematical model that 
we use throughout this paper. We follow mainly the discussion by Sichel [8]. 

Let the total vocabulary that an author uses in his writing consist of 
V distinct words. For each word in the vocabulary V, there is a long-term 
probability of occurrence tti, 7r2, Try where 7ri(7ry) refers to the word with 
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lowest and highest probabihty respectively. Theoretically, we have 



0<7ri <1 ; ^7ri = l . (3) 

i 

The probability of a specific word to appear r times in a total word count 
of N tokens is given by, 



(f,{r\N) =(^^^ 7r''(l - 7r)^-X7r)d7r . 



(4) 



Since all the VTj's arc small, we may replace the binomial by the Poisson 
distribution function. We can then write A = Ntt and Eq.(2) becomes 

Mr\N) = - / e-^^(A^7r)X7r)d7r = - / e-^A7(A)dA . (3) 
r! Jo r\ Jo 

For mathematical convenience, one can substitute N by infinity in the second 
integral in Eq.(3) since the latter is negligibly small between A'^ and infinity. 

The choice of the mixing distribution ■iI){'k), or /(A), is crucial. A set 
of distribution functions was first suggested in [7]. Later, Sichel expressed 
Good's mixing distribution function as 

1 (2(1 - 0)V2/«0)7 1 

where — oo <7<cxd,0<^<1 and a > are constants and is 
the modified Bessel function of the second kind of order 7. In particular, 
if 7 = — |, /(A) will become the so-called inverse Gaussian distribution 
function, which has applications in many different areas [12]. Substituting 
Eq.(4) into Eq.(3) and perform a Bessel function integration, one will obtain 
the corresponding compound Poisson distribution function 

((i-g)V^r i»e/2)\ ^ 

where r > 0. This three parameter family of discrete distribution functions 
is extremely powerful. A number of known distribution functions such as the 

Poisson, negative binomial, geometric. Yule, Good and Riemann distribution 
functions are all special or limiting forms of Eq.(5). If the parameter 7 is 
made negative in Eq.(5), an entirely new set of discrete distribution functions 
is generated. 
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In general, the parameter a characterizes the frequency behavior for low 
values of r, whereas 9 influences the tail and 7 is important for the entire 
sweep of the distribution function. For the calculation of the individual 
probabilities in Eq.(5), Sichel derived a very useful formula based on the 
following Bessel recursion relation 

K,+i {z) = —K,{z) + {z) . (6) 

z 

Using this recursion relation, one can easily obtain the following recursion 
relation for ^(r) 

0(r) = ^(^:±^)^(r - 1) + - 2) . (7) 

Thus, as one obtains the first two probabilities 0(0) and from Eq.(5), 
it is easy to calculate all other probabilities from Eq.(7). It is clear that the 
word frequency distribution functions start at r = 1. A zero truncation of 
the function in Eq.(5) yields 

</.(r) = [((1 - ef'^r-^K.iail - 6)'/') - K,{a)]-'^^^Kr+^{a) , (8) 

for r > 0. This is the Sichel model for word frequencies in its most general 
form. 

3 Word frequencies in DNA sequences 

In this section, we use the Sichel model to study the word frequencies in DNA 
sequences. In order to adapt the Sichel model to the quantitative study of 
DNA sequences, the concept of word must first be defined. In the case of 
coding regions, the words are the 64 3-tuples which code for the amino acids, 
AAA, AAT, etc. For noncoding regions, the words are however unknown. 
Therefore, it is better to consider the word length n as a free parameter and 
perform analyses for n from say, 3 to 8 as was done in [1]. The number of 
n-tuples will be 4". Thus, for n = 6, the number of the 6-tuples will be 
4096. To obtain the word frequency for each n-tuple, we will start from the 
first base pair of the DNA sequence that is under study and progressively 
shift by 1 base with a window of length n. For a DNA sequence containing 
L base pairs, the total number of words will be L — n + 1. 
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To avoid any bias in DNA sequence selection, we performed analysis of 
13 sequences of eukaryotes mammals (GenBank accession codes are HSMH- 
CAPG, HUMGHCSA, HUMHBB, HUMHDABCD, HUMHPRTB, HUM- 
MMDBC, HUMNEUROF, HUMRETBLAS, HUMTCRADCV, HUMVIT- 
DBP, MMBGCXD, MUSTCRA, RATCRYG), 3 sequences of invertebrate 
(CEC07A9, CELTWIMUSC, DROABDB), the yeast chromosome III se- 
quence (SCCHRIII), 10 sequences of eukaryotic viruses (ASFV55KB, HEICG, 
HEHCMVCG, HEVZVXX, HSIULR, HSECOMGEN, HSGEND, IHICG, 
VACCG, VVCGAA), 7 sequences of bacteria (BSGENR, ECOllOK, ECOHU47, 
ECOUW82, ECOUW85, ECOUW87, ECOUW89), and 2 sequences of phage 
(LAMCG, MLCGA). 

In Sichel's model, (j)(r) is the fraction of the total number of words with 
a frequency r of appearance in the article under study. For example, 0(1) is 
the fraction of words among the total number of words used that appear once 
in the article. To implement our analysis using the Sichel model, we first 
record the total number (N) of words (n-tuples) among the total number 
of possible words that are used in the DNA sequence under study. For each 
frequency of appearance, we record the total number {N(r)) of words (n- 
tuples) that have such a frequency r of appearance in that DNA sequence. 
We divide that number by A'' and call it 0(r) and then plot (f){r) against r. 

test is used to obtain the best fit of the data against (?!>(r) in Eq.(5). 

Table I is the result of the test of the DNA sequences chosen from 
different groups of species using the Sichel model. For each of the DNA 
sequences, we give the result for the 6-tuplcs and 7-tuples. We give the best 
values for 7, a and 6 in each case, which is determined by minimizing the 

value, 

. Hi 

where Ni^s are the observed values and n^'s arc the theoretical values. It 
is interesting to note that in the case of the 7-tuples, most of the DNA 
sequences can be fit using an inverse Gaussian distribution (i.e. 7 = —5). 
Fig.l is an illustration of the test of some of the DNA sequences chosen 
from Table I. Wc choose one sequence from each group of DNA sequences 
that we studied. In most cases, (f){r) can be fit reasonably well. 
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4 Sentence Length in DNA sequences 



To study the sentence length of DNA sequences, one needs to define what 
a sentence is. In linguistics, it is easy to identify what a sentence is. In 
the case of DNA sequences, what exactly a sentence should be is unknown. 
One can, for example, identify the word clusters in [3] as DNA sentences. In 
our study here, we proceed with the following strategy. We divide a DNA 
sequence into segments in STich a way as to maximize the compositional 
divergence between the resulting DNA domains until a stopping criterion 
is reached. We then identify each segment as a sentence in the DNA se- 
quence. In our analysis, we use the segmentation method which employs 
the Jensen- Shannon divergence measure [15] to study the bacterial DNA se- 
quence, EcollOK, as an example. We should remind our reader that one can 
use any other segmentation methods to study the sentence length in DNA 
sequences. The number of segments {N(r)) of length (r) is then recorded. 
We again divide N{r) by the total number of segments to obtain the rel- 
ative frequency distribution of segments for r and plot it against r, which 
is shown in Fig.2 [13]. In Fig.2, we present the results for dr = 0.55,0.60 
and 0.65 which correspond to significance level about 76%, 79% and 81% 
respectively. All of the chosen drS follow an inverse Gaussian distribution. 
Table II is the result of the test of the EcollOK DNA sequences using 
the Sichel model. 



5 Summary and Discussion 

In the above, we have introduced a family of compound Poisson distribution 
functions to the statistical study of DNA sequences. We have used the com- 
pound Poisson distribution functions to fit both the n-tuple and segment 
distributions of the DNA sequences. In both cases, we have obtained rea- 
sonable fits, both the shape and the normalization. More interestingly, the 
relative frequency distribution of n-tuples and the compositional segmen- 
tation study follow the inverse Gaussian distribution among different types 
of species and the normalization of (/){r) of both the word frequencies and 
compositional segmentation comes out naturally from the linguistic model 
without ambiguity. 

In the early linguistics feature stTidies [4] of DNA sequences, people have 
plotted the relative occurrence of DNA words against rank and found power 
law behaviors. It was later shown that [6] the Zipf distribution indeed fits 
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very poorly in many cases and the Yule (eq.(2)) distribution can give a much 
better fit. However, as we have mentioned earlier, the Yule distribution was 
suggested primarily for the reason of convergence and most of the fits are 
made when 9 approaches one. This is also true in the case of [6] though the 
fits are better than that of [4]. On the other hand, the distribution suggested 
by Sichel has a much rigorous base. It is based on the fact that only com- 
pound Poisson distribution functions have the properties that characterize 
the word frequency distribution function in linguistics and has its rigorous 
derivation. This model incorporates the characteristic features of linguistics 
and thus a fit using Sichel's model should be of more theoretical interest. 
We have indeed shown that one can obtain reasonably good fits using this 
model. 

As mentioned in the above section, one would want to study the compo- 
sitional segmentation by using segmentation methods other than the one we 
used here. One of such methods is suggested in [17]. In [17], new different 
stopping criteria for segmenting DNA sequences are introduced. The size 
of the segments are plotted against the rank and the result indicates that 
the Zipf plot is different for different segmentation methods. It would be 
interesting to see how the Sichel model can fit for different segmentation 
methods. This would then establish the validity of using the Sichel model 
in the linguistic study of DNA sequences. It would be inappropriate to con- 
clude that our results imply that DNA sequences have any resemblance to 
a natural language. However, it does suggest that DNA sequences are not 
random sequences and they could possibly have subsequence structures [19]. 

6 Acknowledgment 

The work of K.L. Ng is support by the ROC NSC grant NSC 91-2626-E275- 
001, and the Academia Sinica short term visiting program. 



8 



References 

Correspondence author: albertOmail. ltc.edu. tw, address after 
August 1, 2003, Department of Bioinf ormatics , Taichung 
Healthcare and Management University No. 500, Lioufeng 
Road, Wufeng Shiang, Taichung, Taiwan 413, R.O.C. 

[*] Electronic address: spli@phys.sinica.edu.tw 

[1] M. Ortuno et.al, Europhys. Lett. 57, 759 (2002). 

[2] P. Chaudhuri and S. Das, J. Biosci. 27, 1 (2002). 

[3] H.J. Bussemaker et.al., PNAS, 97(18), 10096 (2000). 

[4] R.N. Mantegna et.al, Phys. Rev. Lett. 73, 3169 (1994). 

[5] G.K. Zipf, Selected Studies of the Prineiple of Relative Frequency 
in Language, (Harvard University Press, Cambridge, Massachusetts, 
1932). 

[6] G.K. Zipf, Human Behavior and the Principle of least Effort, (Addison- 
Wesley, 1949). 

[7] G.K. Zipf, The Psycho-Biology of language, An introduction to Dy- 
namic Philology, (MIT Press, Cambridge, Massachusetts, 1965). 

[8] M.Yu. Borodovsky and S.M. Gusein-Zade, J. Biomolecular Structure 
and Dynamics 6, 1001 (1989). 

[9] G.U. Yule, A Statistical Study of Vocabulary, (Cambridge University 
Press, Cambridge, England, 1944). 

[10] LJ. Good, Biometrika 40, 237 (1953). 

[11] G. Herdan, Applied Statistics, 10, 65 (1961). 

[12] H.S. Sichel, J. American Statistical Association 70, 542 (1975). 

[13] H.S. Sichel, J.R. Statist. Soc. A, 137, 25 (1974). 

[14] G.U. Yule, Biometrika 30, 363 (1939). 



9 



[15] J. Lin, IEEE Trans. Info. Theor. 37, 145, (1991). P. Bernaola-Galvan, 
R. Roman-Roldan and J.L. Oliver, Pliys. Rev. E 53, 5181 (1996). I. 
Grosse et al., Phys. Rev. E 65, 041905 (2002). 

[16] V. Seshadri, The inverse Gaussian distribution : statistical theory and 
applications, (New York : Springer Verlag, 1999). 

[17] K.L. Ng, M.C. Chung, and S.P. Li, (to appear in Physica A, 2003, 
Proceedings for StatPhys-Taiwan 2002). 

[18] W. Li, Phys. Rev. Lett. 86, 5815 (2001). 

[19] W. Li, Complexity 3, 33 (1977). 



10 



Table 1. Summary of the frequency distributions for 6-tuples and 7-tuples 





6-tuples 


7-tuples 


Species 


GenBankcode 




I 


a 


e 


/ 


a 


e 


Mammal 


HSMHCAPG 




1.7 


0.4 


0.91 


-0.5 


3.2 


0.91 




HUMGHCSA 


0.48 


1.3 


0.3 


0.93 


-0.5 


3.5 


0.91 




HUMHBB 




1.1 


0.4 


0.95 


-0.5 


3.9 


0.91 




HUMHDABCD 




1.3 


2.2 


0.91 


-0.5 


2.5 


0.91 




HUMHPRTB 




1.3 


0.4 


0.91 


-0.5 


2.8 


0.91 




IIUMMMDBC 




0.5 


3.9 


0.95 


-0.5 


2.4 


0.93 




HUMNEUROF 




0.9 


0.6 


0.97 


0.3 


2.3 


0.91 




HUMRETBLAS 




1.1 


0.1 


0.93 


0.5 


2.2 


0.93 




HUMTCRADCV 




0.9 


1.3 


0.97 


0.3 


2.5 


0.91 




HUMVITDBP 




1.1 


0.4 


0.93 


-0.5 


3.1 


0.91 




MMBGCXD 




1.1 


0.4 


0.93 


-0.5 


3.3 


0.91 




MUSTCRA 




1.3 


0.4 


0.95 


-0.5 


4.4 


0.91 




RATCRYG 




1.3 


0.4 


0.91 


-0.5 


2.7 


0.91 


Invertebrate 


CEC07A9 




0.5 


4.5 


0.93 


-0.5 


2.5 


0.91 




CELTWIMUSC 


0.62 


0.5 


3.9 


0.93 


-0.7 


2.5 


0.91 




DROABDB 




1.5 


4.5 


0.91 


-0.5 


3.4 


0.91 


Yeast Chrlll 


SCCHRIII 




2.3 


2.1 


0.97 


0.5 


4.5 


0.95 


Virus 


ASFV55KB 




0.5 


4.5 


0.91 


-0.7 


2.5 


0.91 




HEICG 




0.5 


4.5 


0.97 


-0.5 


3.2 


0.97 




HEHCMVCG 




2.9 


2.5 


0.95 


0.5 


4.5 


0.93 




HEVZVXX 




3.1 


2.5 


0.91 


-0.5 


4.5 


0.93 




HSIULR 




0.5 


3.8 


0.97 


-0.5 


3.1 


0.95 




HSECOMGEN 




2.5 


4.5 


0.93 


0.5 


4.0 


0.91 




HSGEND 




0.9 


0.3 


0.97 


0.7 


0.1 


0.91 




IHICG 




2.5 


0.1 


0.93 


-0.5 


4.5 


0.95 




VACCG 




1.3 


1.0 


0.97 


-0.5 


4.2 


0.97 




VVCGAA 




1.3 


0.4 


0.97 


1.3 


0.8 


0.97 


Bacteria 


BSGENR 


0.18 


1.5 


4.5 


0.93 


-0.5 


4.1 


0.91 




ECOllOK 


0.24 


2.7 


0.2 


0.91 


-0.5 


4.5 


0.91 




ECOHU47 


0.26 


1.5 


4.5 


0.91 


-0.5 


3.6 


0.91 




ECOUW82 




2.5 


2.8 


0.93 


0.5 


3.2 


0.91 




ECOUW85 




2.1 


2.5 


0.91 


-0.5 


4.1 


0.91 




ECOUW87 




2.3 


2.5 


0.91 


0.5 


4.4 


0.91 




ECOUW89 




3.1 


0.4 


0.93 


0.5 


4.5 


0.91 
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Table 1. Summary of the frequency distributions for 6-tuples and 7-tuples 





6-tuples 


7-tuples 


Species 


GenBankcode 




7 


a 


e 


7 


a 


e 


Bacteriaphage 


LAMCG 




0.5 


4.5 


0.91 


-0.7 


2.5 


0.91 




MLCGA 




1.1 


0.2 


0.93 


-0.5 


2.8 


0.91 



Table 2. The frequency distribution for segmentation 



dr 


7 


a 


9 


0.50 


-0.5 


4.2 


0.97 


0.60 


-0.5 


4.1 


0.99 


0.65 


-0.5 


4.5 


0.99 
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Fig la. The (|)( r ) vs r plot for the mammal sequence HUMHDABCD, where y=1.30, a=2.20 
and 0=0.91 and y=-0.5, a=2.50 and 9=0.91 for 6 -tuples and 7 -tuples word length cases 
respectively. 
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Fig lb. The (|)( r ) vs r plot for the invertebrate sequence CELTWIMUSC, where y=0.50, 
a=3.90 and 9=0.93 and y=-0.7, a=2.50 and 6=0.91 for 6-tuples and 7-tuples word length cases 
respectively. 
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Fig Ic. The (|)( r ) vs r plot for the yeast chromosome III sequence , where y=2.30, a=2.10 and 
9=0.97 and y=0.5, a=4.50 and 0=0.95 for 6-tuples and 7 -tuples word length cases respectively. 
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Fig Id. The <t'( r ) vs r plot for the virus sequence HSIULR, where Y=0.50, a=3.80 and 
6=0.97 and Y=-0.5, a=3.10 and 6=0.95 for 6-tupIes and 7 -tuples word length cases respectively. 



16 





—•— 6BSGENR^" gl50a450t9: 




0.05 
0.04 
^ 0.03 
0.02 
0.01 












20 40 60 

r 



—*— 7BSGENR^" gm50a410a9 








0.12 


^ 




0.1 






S 0.08 
n 0.06 






0.04 






0.02 










10 20 30 4C 
r 





Fig le. The (])( r ) vs r plot for the bacteria sequence BSGENR, where Y=1.50, a=4.50 and 
6=0.93 and Y=-0.5, a=4.10 and 9=0.91 for 6-tuples and 7 -tuples word length cases respectively. 
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Fig. 2(b) 
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Fig.2(c) 

Fig. 2. The <|)( r ) vs r plot for the (a) dr = 0.55, (b) dr= 0.60 and (c) dr=0.65 cases. 
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